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Abstract. A standard model for exposing structured provenance meta¬ 
data of scientific assertions on the Semantic Web would increase interop¬ 
erability, discoverability, reliability, as well as reproducibility for scientific 
discourse and evidence-based knowledge discovery. Several Resource De¬ 
scription Framework (RDF) models have been proposed to track prove¬ 
nance. However, provenance metadata may not only be verbose, but 
also signihcantly redundant. Therefore, an appropriate RDF provenance 
model should be efficient for publishing, querying, and reasoning over 
Linked Data. In the present work, we have collected millions of pair¬ 
wise relations between chemicals, genes, and diseases from multiple data 
sources, and demonstrated the extent of redundancy of provenance in¬ 
formation in the life science domain. We also evaluated the suitability of 
several RDF provenance models for this crowdsourced data set, including 
the N-ary model, the Singleton Property model, and the Nanopublication 
model. We examined query performance against three commonly used 
large RDF stores, including Virtuoso, Stardog, and Blazegraph. Our ex¬ 
periments demonstrate that query performance depends on both RDF 
store as well as the RDF provenance model. 


1 Introduction 

Evidence and provenance are key aspects of a healthy scientihc discourse. A stan¬ 
dard model to provide structured and interoperable metadata linked to scientific 
assertions is of increasing interest [ms]. The Resource Description Framework 
(RDF), the lingua franca for the Semantic Web, offers the building blocks by 
which statements can be provided along with their metadata. Structured meta¬ 
data, such as whether the resource was manually curated or automatically text 




mined from scientific literature, is key to assessing quality of information. Hence, 
a scalable and well-designed RDF-based metadata model is crucial for knowledge 
integration. 

Specifying the provenance of a single entity can be easily achieved using ex¬ 
isting RDF terminologies such as PROV. However, it is the specification of the 
provenance of a binary or n-ary relation which remains non-standard. Several 
models for exposing the provenance metadata of the relations have been pro¬ 
posed including adding provenance annotations to i) an instance of a class that 
represents the n-ary relation (N-ary model) [5]; ii) an instantiated property, i.e. 
Singleton property (SP) model [TH]; and iii) a graph that contains the relational 
assertions, i.e. Nanopublication model [12]. In the life sciences, the N-ary model 
has been used to capture the provenance information for protein-protein inter¬ 
actions (i.e. iRefIndex database |5D|) and text-mined gene-disease interactions 
(i.e. DisGeNET 0), while the recently proposed SP model [TH] has been used 
across elements of biomedical and material sciences. Despite their use to repre¬ 
sent various data, no study has yet been performed to examine the advantages 
and disadvantages of all these models using a common dataset. 

In the present study, we aim to evaluate the consequence of using different 
RDF models to capture provenance metadata for life science data. We examine 
the number of triples generated and query performance on three RDF stores: Vir¬ 
tuoso m, StarDog |S], and BlazeGraph [T]. Regarding to the provenance meta¬ 
data of the relational assertions, we consider the data source, the supporting 
scientific publication, and the biological species where the given assertion holds 
true. In addition to the three basic RDF models described above, we also ex¬ 
amine the implementions of the so-called cardinal assertion model that was first 
introduced by Nanopublications |S] on the N-ary and SP models, to create a non- 
redundant network of assertions. This consideration is particularly important as 
there exists substantive overlap in the assertions from multiple databases. For 
instance, the asserted relation between dexamethasone (PubGhem Compound 
5743) and glucocorticoid receptor (GR) (NCBI Gene 2908) was mentioned by 
four different data sources, but each data source cites an entirely different set 
of scientific publications in support of the assertion. This work is crucial for 
the efficient implementation of scalable, interoperable, and extensible knowledge 
models for open data sources including PubChemRDF SO], Bio2RDF [7|, and 
DisGeNET RDE[TS]. 

2 Methods 

2.1 Dataset preparation 

We generated a reference dataset of pairwise relations between chemicals, genes, 
and diseases from multiple data sources across life science domain. The chemical- 
disease relations were obtained from National Drug File Reference Terminology 
(NDFRT) [5], GTD |5|, KEGG [T3j, and SIDER |TS|; chemical-gene relations 
were obtained from GTD DrugBank [T3], KEGG [T3|, lUPHAR-DB (23], and 
GhEMBL [ni; protein-protein relations were obtained from iRefIndex |20j and 


BioGRID [24]; gene-disease were contributed by DisGeNET [^. All chemicals 
were represented using PubChem Gompound identifiers (CIDs), all genes were 
represented using National Genter for Biotechnology Information (NCBI) Gene 
identifiers (GIDs), and all diseases were represented using the Unified Medical 
Language System (UMLS) Goncept Unique Identifiers. The pairwise relations 
were normalized using the modified Semantic Network standard vocabulary |2Ij . 
The interrelations between biomedical entities (chemicals, genes, and diseases) 
constitute a semantic network, and SPARQL queries were used to explore the 
network topology on behalf of evidence-based hypothesis generation. However, 
it is fairly common to collect the identical assertion from multiple sources, in 
particular, for such a consolidated knowledge base. Hence, additional constraints 
were applied in the searching strategies. 


2.2 RDF model construction 

Five RDF models were studied, including N-ary model with and without cardinal 
assertion (Fig.[^, SP model with and without cardinal assertion (Fig.[^, and the 
Nanopublication model (Fig. [^. Only the assertion graphs and the provenance 
graphs were considered in the Nanopublication model. In both N-ary and SP 
cardinal assertion variants, a predicate cito :providesAssertionFor is used to 
link the cardinal assertion of the pairwise relation to the multiple evidence (Fig. 

and[^). Without cardinal assertion, the pairwise relation would be asserted 
redundantly by multiple data sources (Fig. and[^). In the Nanopublication 
model variant A, one assertion graph may correspond with one or more than 
one provenance graphs (Fig. [^. In the following comparative analysis. Model 
I refers to the N-ary model with cardinal assertion. Model II refers to the N- 
ary model without cardinal assertion. Model III refers to the SP model with 
cardinal assertion. Model IV refers to the SP model without cardinal assertion, 
and Model V refers to the Nanopublication model. 


2.3 Query formulation 

An interesting research topic in drug discovery is to determine which proteins 
are responsible for eliciting particular drug side effects. We formulated SPARQL 
queries to examine this question using different levels of complexity (Ql, Q2) 
and provenance constraints (Q3, Q4). Ql explores the hypothesis that if chemi¬ 
cal A inhibits gene B, and gene B interacts with gene C, and gene C is linked to 
disease D, then the above path can be used to explain the disease/adverse side 
effect D caused by chemical A. It should be noted that the observed side effect 
can be explained in several ways: either the aforementioned three-step indirect 
paths, or the two-step indirect path involving only the chemical-gene interaction 
and gene-disease associations. Therefore, we have constructed another query, 
i.e. Q2, to filter out the diseases that are associated with genes that directly 
interact with the given chemical. The first two queries do not take into account 
the provenance metadata, and it is usually the case that only the integrated 
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Fig. 1. Graphical representation of N-ary model for the relation between compound 
CID5743 and gene 2908: (a) with cardinal assertion (Model I); (b) without cardinal 
assertion (Model II). 


assertions are considered on behalf of hypothesis generation and knowledge dis¬ 
covery. Q3 narrows down the search results by applying data source constraints. 
Q4 restricts by number of aggregated evidence on Ql: such that the query only 
considers the pairwise relations in the indirect path that have more than one 
supporting literature references. 

We carried out Ql through Q4 on six chemicals that have extensive biomed¬ 
ical annotations from multiple data sources: propranolol (CID4946), clotrimazole 
(CID2812), mitoxantrone (CID4212), risperidone (CID5073), chlorpromazine (C- 
ID2726), and haloperidol (CID3559). There are hundreds of similar compounds 
in the integrated dataset and they are of key interest in the context of drug 
repurposing and development. 

All queries were performed against three RDF stores without further tuning: 
open source Virtuoso 7.1, Stardog 2.2, and Blazegraph 1.5. The configuration 
allowed up to 16 GB memory for each RDF store to run queries, which were 
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Fig. 2. Graphical representation of SP model for the relation between compound 
CID5743 and gene 2908: (a) with cardinal assertion (Model III); (b) without cardi¬ 
nal assertion (Model IV). 


performed on cold cache. The LoglO transformations of the execution time in 
millisecond were illustrated in boxplot; the averages and standard deviations 
of the execution time in seconds were summarized as well in the comparative 
analysis. 

The data sets and the SPARQL queries are available at: http://f igshare. 
com/articles/Provenance_RDF_Models/1399197, 


3 Results and Discussions 

3.1 Data set statistics 

We first compared the total number of triples that each RDF model contains. The 
most efficient RDF model is SP model without cardinal assertion (Model IV), 
which contains 17,239,427 triples, and the cardinal assertion of SP model (Model 
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Fig. 3. Graphical representation of Nanopublication model (Model V) for the relation 
between compound CID5743 and gene 2908. 


Ill) increased the total number of triples by about 14% to 19,575,298. For N-ary 
models, the cardinal assertion also increased the total number of triples by about 
6%, from 21,445,348 (Model II) to 22,787,218 (Model I). The N-ary model re¬ 
quires two triples (predicates sio :has-agent and sio :has-target) to represent 
the agent and target in a biological process, while the SP model maintains the 
previous binary relation structure in only one triple. Hence, with the cardinal 
assertion, the N-ary model (Model I) contains 3,211,920 more triples (~ 16%) in 
comparison with the SP model (Model III), and without the cardinal assertion, 
the N-ary model (Model II) contains even more triples (4,205,921 triples) in 
contrast to the SP model (Model IV). The Nanopublication model is the most 
verbose model in this regard, which contains 27,605,782 triples distributed in 
8,251,238 graphs. 

We also studied the amount of evidence associated with each relational asser¬ 
tion to illustrate the degree of redundancy with respect to the identical pairwise 
relations in the life science domain. We only examine object property instances 
representing the pairwise relations that were created in the SP models (Model 
III and Model IV), as the degree of redundancy is same across other RDF mod- 





















































els. The total number of unique subjects in the SP models with and without 
cardinal assertion are 7,654,605 (Model III) and 4,442,685 (Model IV), respec¬ 
tively. The difference between the two numbers accounts for the total number 
of object property instances arbitrarily created for the cardinal assertions. If 
there are multiple cases of evidence for a given assertion, the cardinal assertion 
variant may reduce the total number of triples to express the same information, 
however, if there is only one case of evidence for a given assertion, the cardinal 
assertion will increase the total number of triples. Hence, whether the cardinal 
assertion can reduce the total number of triples depends on the extent of re¬ 
dundancy of the identical pairwise relations in the data set. Among 3,211,920 
cardinal assertions, 2,800,124 (^87%) of them are only associated with one evi¬ 
dence, 238,558 (~7%) of them are associated with two cases of evidence, 67,088 
(^2%) of them are associated with three cases of evidence, and 98,625 (~3%) 
of them are associated with more than three cases of evidence. The pairwise 
relations between PubChem compound CID5694 and NCBI gene GID5465 is 
associated with the most number of cases of evidence (3,096). Although there 
were many redundant assertions from multiple data sources, the majority have 
only one supporting evidence. Hence, the increase in the total number of triples 
were largely attributable to publication assertions. 

3.2 Query performance evaluation 

We undertook a performance evaluation using three RDF databases (see Table 
[^. With Virtuoso, the SP models with and without cardinal assertion (Model III 
and IV) largely outperformed the other models. Q1 and Q2 executed roughly 
100 times faster on the SP models as compared to the N-ary models. Although 
Model V yielded comparable performance with Model III and IV in Qi, the 
additional filtering constraint made it much slower in Q2. In Q4, Model III, 
IV, and V performed similarly, which are 10 times faster than Model II and 
100 times faster than Model I. In general. Virtuoso performed best using the 
SP models. With the Stardog RDF store, the N-ary models and the SP models 
were comparable in performance, but they always outperformed Nanopublica¬ 
tion model. In particular, when the aggregated evidence was considered in Q4, 
both N-ary and SP models with and without cardinal assertion were carried 
out over 10 times faster than the Nanopublication model. Using Blazegraph, the 
Nanopublication model generally outperformed other models. In particular, Ql 
and Q2 were carried out over 10 times faster in Model V rather than in other 
models. 

Without querying the provenance metadata, the models with cardinal asser¬ 
tion (Model I and III) always yielded better performance in comparison with 
the models without cardinal assertion (Model II and IV accordingly). Hence, if 
we remove the redundant identical assertions from various data sources in both 
N-ary and SP models, the graph traversal-like queries can be executed much 
faster. If we think of conjunctive queries (i.e. graph traversal or inner join) as 
performing Cartesian products, the computational costs go up exponentially as 
the number of data items increase. Hence, the redundant pairwise relations cost 


much more time rather than cardinal assertions in Q1 and Q2. However, if the 
provenance restrictions were considered, the model without cardinal assertion 
(Model II and IV) usually outperformed, except the Q3 of the SP models exe¬ 
cuted in Stardog and Q4 of both N-ary and SP models executed in Blazegraph. 
But the difference of query performance were usually small, except for the Q4 
of the N-ary models executed in Virtuoso, and the Q3 of both N-ary and SP 
models executed in Blazegraph. So in general, if the provenance restrictions were 
considered, the models with and without cardinal assertion were comparable. 


Table 1. The average execution time and standard deviation in seconds 




Model 1“ 

Model 11“ 

Model III“ 

Model IV“ 

Model V“ 


Q1 

44.827 

269.854 

0.665 

2.398 

1.283 


(±15.918) 

(±99.266) 

(±0.263) 

(±1.082) 

(±0.212) 


Q2 

260.337 

369.635 

0.535 

2.375 

585.52 

Virtuoso 

(±253.588) 

(±120.482) 

(±0.301) 

(±1.083) 

(±193.382) 


Q3 

4.04 

3.069 

3.075 

2.248 

2.287 


(±0.294) 

(±0.161) 

(±0.243) 

(±0.718) 

(±0.049) 


Q4 

352.312 

14.994 

2.201 

1.953 

2.531 


(±204.483) 

(±11.587) 

(±0.028) 

(±0.331) 

(±0.054) 


Q1 

1.906 

5.499 

3.354 

16.805 

21.291 


(±0.214) 

(±0.191) 

(±0.783) 

(±13.315) 

(±0.621) 


Q2 

2.45 

6.366 

4.072 

18.398 

184.96 

StarDog 

(±0.262) 

(±0.208) 

(±0.492) 

(±14.383) 

(±90.820) 


Q3 

2.738 

1.291 

3.834 

16.463 

27.537 


(±0.240) 

(±0.068) 

(±0.500) 

(±14.277) 

(±1.602) 


Q4 

14.344 

9.084 

9.575 

8.698 

45.959 


(±1.576) 

(±0.147) 

(±0.751) 

(±0.467) 

(±1.263) 


Q1 

11.087 

54.74 

33.597 

41.491 

0.732 


(±5.129) 

(±32.618) 

(±10.515) 

(±11.862) 

(±0.133) 


Q2 

10.853 

56.599 

33.469 

41.099 

4.215 

Blazegraph 

(±4.494) 

(±34.590) 

(±10.292) 

(±12.697) 

(±1.635) 


Q3 

10.944 

1.56 

6.05 

0.465 

0.581 


(±4.915) 

(±0.533) 

(±1.284) 

(±0.054) 

(±0.071) 


Q4 

83.384 

117.131 

74.505 

89.054 

80.729 


(±1.612) 

(±2.721) 

(±1.570) 

(±1.119) 

(±1.436) 


“ The average execution times are in the first line, and the standard devia¬ 
tions are in the second line within parenthesis; the best performance has been 
highlighted in bold. 











4 Conclusion 


In this study, we evaluated three existing RDF models and two cardinal assertion 
models for representing relations and exposing their provenance metadata. We 
examined the effect of each model on overall graph size and query time execution 
across three different RDF databases. Since our integrated life science dataset 
contained many duplicate assertions, graph traversal can be accomplished in a 
much more efficient way using the cardinal assertion. The redundant assertions 
add up a lot of computational overhead when searching through the integrated 
knowledge base for evidence-based hypothesis exploration. Surprisingly, we found 
that each RDF store performed the best using a different provenance model. It 
has been demonstrated that SPARQL queries may be executed in a RDF store 
specific manner in a previous analysis m Our results drew a similar conclusion 
and may have contentious implications for the standardization of a provenance 
model, which should ideally be software/platform/system agnostic. A more ex¬ 
tensive analysis with larger benchmark datasets and more query patterns would 
be helpful in the future study. 
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